{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 准备工作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "阿里研究院\n",
    "阿里健康\n",
    "阿里巴巴商学院\n",
    "阿里数据\n",
    "小北\n",
    "softime\n",
    "\n",
    "腾讯金融科技\n",
    "腾讯研究院\n",
    "腾讯媒体研究院\n",
    "腾讯云启研究院\n",
    "酷鹅用户研究院\n",
    "'''\n",
    "公众号 = \"softime\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "fn = { \"output\" : { \"公众号_htm_snippets\": \"data_raw_src/公众号_htm_snippets_{公众号}.tsv\",\n",
    "                    \"公众号_df\": \"data_raw_src/公众号_df_{公众号}.tsv\",\n",
    "                    \"公众号_xlsx\": \"data_sets/公众号_url_{公众号}.xlsx\" } \\\n",
    "      }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 采集公众号（requests）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 采集公众号（selenium）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from lxml.html import fromstring\n",
    "import time\n",
    "from random import random\n",
    "\n",
    "# when selenium main_content is used\n",
    "# Parses an HTML document from a string constant.  Returns the root nood\n",
    "# root = fromstring(df.loc[1,\"html_snippets\"]) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用Selenium\n",
    "* 要更改 opts.binary_location 至自己本地的Chrome浏览器，建议portable\n",
    "* Chrome浏览器 和 chromedriver.exe要同版本号到小数后一位\n",
    "* 要确保可以 开启浏览器机器人\n",
    "* 要确保浏览器机器人 可以打开网页 driver.get(\"https://mp.weixin.qq.com\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\62633\\anaconda3\\lib\\site-packages\\ipykernel_launcher.py:18: DeprecationWarning: use options instead of chrome_options\n"
     ]
    }
   ],
   "source": [
    "from selenium import webdriver\n",
    "from selenium.webdriver.common.desired_capabilities import DesiredCapabilities\n",
    "\n",
    "#caps=dict()\n",
    "#caps[\"pageLoadStrategy\"] = \"none\"   # Do not wait for full page load\n",
    "\n",
    "opts = webdriver.ChromeOptions()\n",
    "opts.add_argument('--no-sandbox')#解决DevToolsActivePort文件不存在的报错\n",
    "opts.add_argument('window-size=1920x3000') #指定浏览器分辨率\n",
    "opts.add_argument('--disable-gpu') #谷歌文档提到需要加上一这个属性来规避bug\n",
    "opts.add_argument('--hide-scrollbars') #隐藏滚动条, 应对些特殊页面\n",
    "#opts.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度\n",
    "#opts.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败\n",
    "\n",
    "opts.binary_location = r\"C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe\" #\"H:\\_coding_\\Gitee\\InternetNewMedia\\CapstonePrj2016\\chromedriver.exe\"  \n",
    "\n",
    "# \"H:\\_coding_\\Gitee\\InternetNewMedia\\CapstonePrj2016\\chromedriver.exe\"  \n",
    "driver = webdriver.Chrome( chrome_options = opts) #desired_capabilities=caps,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.get(\"https://mp.weixin.qq.com\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 填表登入"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "selenium 的定位方法\n",
    "* find_element_by_id &ensp;&ensp;&ensp;  根据标签id定位\n",
    "* find_element_by_name   &ensp;&ensp;&ensp; 根据标签的name定位\n",
    "* find_element_by_xpath  &ensp;&ensp;&ensp; 根据xpath定位\n",
    "* find_element_by_link_text  &ensp;&ensp;&ensp; 通过文字链接来定位元素\n",
    "* find_element_by_partial_link_text  &ensp;&ensp;&ensp;  通过文字链接来定位元素\n",
    "* find_element_by_tag_name  &ensp;&ensp;&ensp;  根据标签的名字定位\n",
    "* find_element_by_class_name  &ensp;&ensp;&ensp; 通过class name 定位\n",
    "* find_element_by_css_selector  &ensp;&ensp;&ensp;  根据元素属性来定位"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "payload =  {\"account\": \"626331464@qq.com\", \"password\": \"13026759362.@\"}\n",
    "# payload =  {\"account\": \"NFUHacks@163.com\", \"password\": \"NFU706947580\"}\n",
    "driver.find_element_by_xpath('//div[@class=\"login__type__container login__type__container__scan\"]/a').click()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "WebDriver 常用方法：\n",
    "* clear()清楚文本\n",
    "* send_keys(values)模拟按键输入\n",
    "* click()模拟点击\n",
    "* submit模拟提交"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"account\"]').clear()\n",
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"account\"]').send_keys(payload['account'])\n",
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"password\"]').clear()\n",
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"password\"]').send_keys(payload['password'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.find_element_by_xpath('//div[@class=\"login_btn_panel\"]/a').click()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 点选单"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "其他常用方法\n",
    "* size：返回元素的尺寸\n",
    "* text：获取元素的文本\n",
    "* get_attribute：获取属性值  &ensp;&ensp;&ensp; get_attribute('innerHTML')获取元素内的全部HTML\n",
    "* is_displayed()：设置该元素用户是否可见"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "ename": "ElementNotInteractableException",
     "evalue": "Message: element not interactable\n  (Session info: chrome=81.0.4044.138)\n",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mElementNotInteractableException\u001b[0m           Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-48-b8247c7b9ef1>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[0melement\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdriver\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfind_element_by_xpath\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'//a[@id=\"m_open\"]'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0melement\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclick\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      3\u001b[0m \u001b[0mmain_content\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0melement\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_attribute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'innerHTML'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      4\u001b[0m \u001b[0mmain_content\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\selenium\\webdriver\\remote\\webelement.py\u001b[0m in \u001b[0;36mclick\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m     78\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mclick\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     79\u001b[0m         \u001b[1;34m\"\"\"Clicks the element.\"\"\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 80\u001b[1;33m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_execute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mCommand\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mCLICK_ELEMENT\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     81\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     82\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0msubmit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\selenium\\webdriver\\remote\\webelement.py\u001b[0m in \u001b[0;36m_execute\u001b[1;34m(self, command, params)\u001b[0m\n\u001b[0;32m    631\u001b[0m             \u001b[0mparams\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m{\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    632\u001b[0m         \u001b[0mparams\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'id'\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_id\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 633\u001b[1;33m         \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_parent\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    634\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    635\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mfind_element\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mby\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mBy\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mID\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\selenium\\webdriver\\remote\\webdriver.py\u001b[0m in \u001b[0;36mexecute\u001b[1;34m(self, driver_command, params)\u001b[0m\n\u001b[0;32m    319\u001b[0m         \u001b[0mresponse\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcommand_executor\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdriver_command\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    320\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mresponse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 321\u001b[1;33m             \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0merror_handler\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcheck_response\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresponse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    322\u001b[0m             response['value'] = self._unwrap_value(\n\u001b[0;32m    323\u001b[0m                 response.get('value', None))\n",
      "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\selenium\\webdriver\\remote\\errorhandler.py\u001b[0m in \u001b[0;36mcheck_response\u001b[1;34m(self, response)\u001b[0m\n\u001b[0;32m    240\u001b[0m                 \u001b[0malert_text\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mvalue\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'alert'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'text'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    241\u001b[0m             \u001b[1;32mraise\u001b[0m \u001b[0mexception_class\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmessage\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mscreen\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstacktrace\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0malert_text\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 242\u001b[1;33m         \u001b[1;32mraise\u001b[0m \u001b[0mexception_class\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmessage\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mscreen\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstacktrace\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    243\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    244\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0m_value_or_default\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mobj\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdefault\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mElementNotInteractableException\u001b[0m: Message: element not interactable\n  (Session info: chrome=81.0.4044.138)\n"
     ]
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//a[@id=\"m_open\"]')\n",
    "element.click()\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "main_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.execute_script(\"window.scrollTo(0,document.body.scrollHeight)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://mp.weixin.qq.com/cgi-bin/appmsg?begin=0&count=10&t=media/appmsg_list&type=10&action=list&token=1352791831&lang=zh_CN'"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//li[@title[contains(.,\"素材管理\")]]/a') \n",
    "# main_content = element.get_attribute('innerHTML')\n",
    "# main_content\n",
    "url2= element.get_attribute(\"href\")\n",
    "url2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.get(url2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 新建图文消息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "element = driver.find_element_by_xpath('//*[text()[contains(.,\"新建图文消息\")]]') \n",
    "main_content = element.get_attribute('innerHTML')\n",
    "main_content\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['CDwindow-5BD98590E94F467BC61398FADFB0E12C', 'CDwindow-9170EA9ED85277E0A84D3241F1DFCFD4']\n"
     ]
    }
   ],
   "source": [
    "print (driver.window_handles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 新建图文消息开了另一分视窗，所以要切换 switch_to \n",
    "driver.switch_to.window(driver.window_handles[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 超链接"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                超链接              \n"
     ]
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//*[text()[contains(.,\"超链接\")]]') \n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "选择其他公众号\n"
     ]
    }
   ],
   "source": [
    "# 点 选择其他公众号\n",
    "element = driver.find_element_by_xpath('//*[text()[contains(.,\"选择其他公众号\")]]') \n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.find_element_by_xpath('//form//div[@class=\"inner_link_account_area\"]//input[@class=\"weui-desktop-form__input\"]').clear()\n",
    "driver.find_element_by_xpath('//form//div[@class=\"inner_link_account_area\"]//input[@class=\"weui-desktop-form__input\"]').send_keys(公众号)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div class=\"weui-desktop-icon weui-desktop-icon__inputSearch weui-desktop-icon__small\"><!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <svg width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" xmlns=\"http://www.w3.org/2000/svg\"><path d=\"M11.33 10.007l4.273 4.273a.502.502 0 0 1 .005.709l-.585.584a.499.499 0 0 1-.709-.004L10.046 11.3a6.278 6.278 0 1 1 1.284-1.294zm.012-3.729a5.063 5.063 0 1 0-10.127 0 5.063 5.063 0 0 0 10.127 0z\"></path></svg> <!----> <!----> <!----> <!----></div>\n"
     ]
    }
   ],
   "source": [
    "# 点放大镜搜\n",
    "element = driver.find_element_by_xpath('//button[@class=\"weui-desktop-icon-btn weui-desktop-search__btn\"]')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/vxbU9bvk15pZbxH9pS6KxPwRpTYxnicXqwyOPshssgKseVicN027vEYjWZ7cshVjfs0rBJGX9iaL950DGmowL5iapA/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">Softime</strong> <i class=\"inner_link_account_wechat\">微信号：未设置</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li><li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/y4xpjAianWEsF9Nkib0oy2RRz1ibh2CQiao8hIVhqT8jRqmpY1sBAuzDQDJdft4jLfgibickld4mLn00mEP8xybeVvQA/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">柔软时光softime</strong> <i class=\"inner_link_account_wechat\">微信号：未设置</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li><li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/fMplnTT0RVyUP49DQazY5hKArkWAXRpm3Lxd7rkX7qK149Zd4tT2HdNO18U2PYjIL7B5wfqWt7VNtictIUA6s4w/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">Oceans of Time</strong> <i class=\"inner_link_account_wechat\">微信号：未设置</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li><li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/ebDsk58ovP72FqGEibDgfNTKHqY1jjFxCHMlBB3w6DibDgrlbD7GuLzHWpzFsol3Aj0IGBqV2LXf77q5RpDKvib3w/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">SOFTiME轻松一刻</strong> <i class=\"inner_link_account_wechat\">微信号：Softime-wx</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li>\n"
     ]
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//ul[@class=\"inner_link_account_list\"]')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "公众号SERP = main_content\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 解析\n",
    "root = fromstring(公众号SERP) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "主 = root.xpath('//li[@class=\"inner_link_account_item\"]')\n",
    "\n",
    "account_list = []\n",
    "for e in 主:\n",
    "    account_nickname = e.xpath('./div/strong[@class=\"inner_link_account_nickname\"]')[0].text\n",
    "    account_wechat = e.xpath('./div/i[@class=\"inner_link_account_wechat\"]')[0].text\n",
    "    account_img = e.xpath('./div/img/@src')[0]\n",
    "    account = {\"nickname\": account_nickname, \"wechat\": account_wechat, \"img\": account_img,}\n",
    "    account_list.append(account)\n",
    "\n",
    "df_account = pd.DataFrame(account_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>nickname</th>\n",
       "      <th>wechat</th>\n",
       "      <th>img</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Softime</td>\n",
       "      <td>微信号：未设置</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/vxbU9bvk15pZbxH...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>柔软时光softime</td>\n",
       "      <td>微信号：未设置</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/y4xpjAianWEsF9N...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Oceans of Time</td>\n",
       "      <td>微信号：未设置</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/fMplnTT0RVyUP49...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>SOFTiME轻松一刻</td>\n",
       "      <td>微信号：Softime-wx</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/ebDsk58ovP72FqG...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         nickname          wechat  \\\n",
       "0         Softime         微信号：未设置   \n",
       "1     柔软时光softime         微信号：未设置   \n",
       "2  Oceans of Time         微信号：未设置   \n",
       "3     SOFTiME轻松一刻  微信号：Softime-wx   \n",
       "\n",
       "                                                 img  \n",
       "0  http://mmbiz.qpic.cn/mmbiz_png/vxbU9bvk15pZbxH...  \n",
       "1  http://mmbiz.qpic.cn/mmbiz_png/y4xpjAianWEsF9N...  \n",
       "2  http://mmbiz.qpic.cn/mmbiz_png/fMplnTT0RVyUP49...  \n",
       "3  http://mmbiz.qpic.cn/mmbiz_png/ebDsk58ovP72FqG...  "
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_account"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/vxbU9bvk15pZbxH9pS6KxPwRpTYxnicXqwyOPshssgKseVicN027vEYjWZ7cshVjfs0rBJGX9iaL950DGmowL5iapA/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">Softime</strong> <i class=\"inner_link_account_wechat\">微信号：未设置</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div>\n"
     ]
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//ul[@class=\"inner_link_account_list\"]/li')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n跳转_input = driver.find_element_by_xpath(\\'//span[@class=\"weui-desktop-pagination__form\"]/input\\')\\n跳转_a = driver.find_element_by_xpath(\\'//span[@class=\"weui-desktop-pagination__form\"]/a\\')\\n跳转_input.clear()\\n跳转_input.send_keys(2)\\n跳转_a.click()\\n'"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 跳转testing\n",
    "'''\n",
    "跳转_input = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/input')\n",
    "跳转_a = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/a')\n",
    "跳转_input.clear()\n",
    "跳转_input.send_keys(2)\n",
    "跳转_a.click()\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 4]\n",
      "False\n"
     ]
    }
   ],
   "source": [
    "# 跳转上限\n",
    "l_e = driver.find_elements_by_xpath('//label[@class=\"weui-desktop-pagination__num\"]')\n",
    "l_e_int  = [int(x.text) for x in l_e] \n",
    "print (l_e_int)\n",
    "print (l_e_int[0]==l_e_int[-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 2, 3, 4]\n"
     ]
    }
   ],
   "source": [
    "pages = list(range(l_e_int[0],l_e_int[-1]+1 ))\n",
    "#print(pages[0:2])\n",
    "pages = list(range(1,l_e_int[-1]+1 ))\n",
    "print(pages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 循环/遍历"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# global varialbes \n",
    "html_raw = dict()\n",
    "main_content =\"\"\n",
    "element = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_pages (pages):\n",
    "    for p in pages:\n",
    "        print (p,end='\\t')\n",
    "\n",
    "        跳转_input = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/input')\n",
    "        跳转_a = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/a')\n",
    "        跳转_input.clear()\n",
    "        跳转_input.send_keys(p)\n",
    "        跳转_a.click()\n",
    "\n",
    "        time.sleep(45+120*random())\n",
    "\n",
    "        element = driver.find_element_by_xpath('//div[@class=\"inner_link_article_list\"]')\n",
    "        main_content = element.get_attribute('innerHTML')\n",
    "        #print(main_content)\n",
    "        html_raw[p] = main_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "process_pages(pages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame([html_raw]).T\n",
    "df.columns = [\"html_snippets\"]\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store html_raw\n",
    "import pickle \n",
    "filehandler = open(\"html_raw\", 'wb') \n",
    "pickle.dump(html_raw, filehandler)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_out = df[~df.duplicated()]\n",
    "print (len(df_out))\n",
    "df[df.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try_again = list(df[df.duplicated()].index)\n",
    "print(try_again)\n",
    "try_again = try_again + list (set(pages).difference(set(df.index.values)))\n",
    "try_again"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 暂存档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = fn [\"output\"] [\"公众号_htm_snippets\"] \n",
    "df_out.to_csv(filename.format(公众号=公众号), sep=\"\\t\", encoding=\"utf8\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5,9,5,3,"
     ]
    }
   ],
   "source": [
    "def parse_html_snippets(_snippet_):\n",
    "    root = fromstring(_snippet_) \n",
    "    title = [x.text for x in root.xpath('//div[@class=\"inner_link_article_title\"]')]\n",
    "    create_time = [x.text for x in root.xpath('//div[@class=\"inner_link_article_date\"]')]\n",
    "    link = [x for x in root.xpath('//a/@href')]\n",
    "    _df_ = pd.DataFrame({\"title\":title, \"create_time\": create_time, \"link\":link})\n",
    "    return(_df_)\n",
    "    \n",
    "l_df = []\n",
    "for p in pages:\n",
    "    _df_ = parse_html_snippets(df.loc[p,\"html_snippets\"])\n",
    "    print (len(_df_), end=\",\")\n",
    "    l_df.append(_df_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>与自己和解，生活会快乐很多</td>\n",
       "      <td>2020-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>人生有多少个十年</td>\n",
       "      <td>2020-04-27</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>受够了，我想开学了</td>\n",
       "      <td>2020-04-19</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>四月你好|你才是疲惫生活中的梦想与糖</td>\n",
       "      <td>2020-04-12</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00后，应该怎样面对这个社会</td>\n",
       "      <td>2020-04-05</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>成年人都学会了悄无声息</td>\n",
       "      <td>2020-03-29</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>这人世间的烟火，我也曾努力看过</td>\n",
       "      <td>2020-03-22</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>白色情人节|情侣之间，什么最重要？</td>\n",
       "      <td>2020-03-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>你还在问什么时候开学吗？</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>十年剧情终落幕，有情人终成眷属</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>青春是过往，年少是序章</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>青春是过往，年少是序章</td>\n",
       "      <td>2020-03-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>白天的理性总在夜色的沉默里翻了船</td>\n",
       "      <td>2020-03-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>无论怎样，年轻的你并不丑</td>\n",
       "      <td>2020-03-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>白天的理性总在夜色的沉默里翻了船</td>\n",
       "      <td>2020-02-27</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>无论怎样，年轻的你并不丑</td>\n",
       "      <td>2020-02-26</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>十年剧情终落幕，有情人终成眷属</td>\n",
       "      <td>2020-02-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>怕麻烦，会让你的人生变得越来越平庸</td>\n",
       "      <td>2020-02-09</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>再见，过去三年  你好，未来四年</td>\n",
       "      <td>2018-10-05</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>余生很长，我只想和你走一趟</td>\n",
       "      <td>2018-09-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>有个男人，他很爱你</td>\n",
       "      <td>2018-08-30</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 title create_time  \\\n",
       "0        与自己和解，生活会快乐很多  2020-05-06   \n",
       "1             人生有多少个十年  2020-04-27   \n",
       "2            受够了，我想开学了  2020-04-19   \n",
       "3   四月你好|你才是疲惫生活中的梦想与糖  2020-04-12   \n",
       "4       00后，应该怎样面对这个社会  2020-04-05   \n",
       "5          成年人都学会了悄无声息  2020-03-29   \n",
       "6      这人世间的烟火，我也曾努力看过  2020-03-22   \n",
       "7    白色情人节|情侣之间，什么最重要？  2020-03-14   \n",
       "8         你还在问什么时候开学吗？  2020-03-08   \n",
       "9      十年剧情终落幕，有情人终成眷属  2020-03-08   \n",
       "10         青春是过往，年少是序章  2020-03-08   \n",
       "11         青春是过往，年少是序章  2020-03-04   \n",
       "12    白天的理性总在夜色的沉默里翻了船  2020-03-04   \n",
       "13        无论怎样，年轻的你并不丑  2020-03-04   \n",
       "14    白天的理性总在夜色的沉默里翻了船  2020-02-27   \n",
       "15        无论怎样，年轻的你并不丑  2020-02-26   \n",
       "16     十年剧情终落幕，有情人终成眷属  2020-02-14   \n",
       "17   怕麻烦，会让你的人生变得越来越平庸  2020-02-09   \n",
       "18    再见，过去三年  你好，未来四年  2018-10-05   \n",
       "19       余生很长，我只想和你走一趟  2018-09-08   \n",
       "20           有个男人，他很爱你  2018-08-30   \n",
       "\n",
       "                                                 link  \n",
       "0   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "1   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "2   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "3   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "4   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "5   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "6   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "7   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "8   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "9   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "10  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "11  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "12  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "13  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "14  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "15  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "16  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "17  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "18  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "19  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "20  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  "
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_url_out = pd.concat(l_df).reset_index(drop=True)\n",
    "df_url_out.loc[0:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>怕麻烦，会让你的人生变得越来越平庸</td>\n",
       "      <td>2020-02-09</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>再见，过去三年  你好，未来四年</td>\n",
       "      <td>2018-10-05</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>余生很长，我只想和你走一趟</td>\n",
       "      <td>2018-09-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>有个男人，他很爱你</td>\n",
       "      <td>2018-08-30</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>愿你我被世界善待，被生活热爱</td>\n",
       "      <td>2018-08-28</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                title create_time  \\\n",
       "17  怕麻烦，会让你的人生变得越来越平庸  2020-02-09   \n",
       "18   再见，过去三年  你好，未来四年  2018-10-05   \n",
       "19      余生很长，我只想和你走一趟  2018-09-08   \n",
       "20          有个男人，他很爱你  2018-08-30   \n",
       "21     愿你我被世界善待，被生活热爱  2018-08-28   \n",
       "\n",
       "                                                 link  \n",
       "17  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "18  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "19  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "20  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "21  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  "
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_url_out.tail(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>value</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>与自己和解，生活会快乐很多</td>\n",
       "      <td>2020-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>人生有多少个十年</td>\n",
       "      <td>2020-04-27</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>受够了，我想开学了</td>\n",
       "      <td>2020-04-19</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00后，应该怎样面对这个社会</td>\n",
       "      <td>2020-04-05</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>你还在问什么时候开学吗？</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>青春是过往，年少是序章</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>青春是过往，年少是序章</td>\n",
       "      <td>2020-03-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                title create_time  \\\n",
       "value                               \n",
       "0       与自己和解，生活会快乐很多  2020-05-06   \n",
       "1            人生有多少个十年  2020-04-27   \n",
       "2           受够了，我想开学了  2020-04-19   \n",
       "4      00后，应该怎样面对这个社会  2020-04-05   \n",
       "8        你还在问什么时候开学吗？  2020-03-08   \n",
       "10        青春是过往，年少是序章  2020-03-08   \n",
       "11        青春是过往，年少是序章  2020-03-04   \n",
       "\n",
       "                                                    link  \n",
       "value                                                     \n",
       "0      http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "1      http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "2      http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "4      http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "8      http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "10     http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "11     http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  "
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# tagging 标记\n",
    "tagging_list = [\"\",\"世界\", \"热爱\", \"善待\",\"爱\",\"余生\",\"理性\",\"沉默\",\"年轻\",\\\n",
    "                \"麻烦\",\"平庸\",\\\n",
    "                \"有情人终成眷属\",\"再见\",\"你好\",\"人世间的烟火\",\"努力\",\\\n",
    "                \"情侣\",\"悄无声息\"] #overwritable\n",
    "\n",
    "v_v_list = []\n",
    "\n",
    "for tag in tagging_list:\n",
    "    index_list = df_url_out [ df_url_out.title.str.contains(tag) ].index.tolist()\n",
    "    v_v_pairs = pd.DataFrame({tag:index_list}).melt().set_index(\"value\")\n",
    "    v_v_list.append(v_v_pairs)\n",
    "\n",
    "df_cat = v_v_list[0]\n",
    "for d in v_v_list:\n",
    "    df_cat.update(d)\n",
    "    \n",
    "# 尚未标记内容\n",
    "df_url_out.loc [ df_cat.query('variable==\"\"').index ]"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "df_url_out.loc[53].link"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [title, create_time, link]\n",
       "Index: []"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_url_out[df_url_out.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>与自己和解，生活会快乐很多</td>\n",
       "      <td>2020-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>人生有多少个十年</td>\n",
       "      <td>2020-04-27</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>受够了，我想开学了</td>\n",
       "      <td>2020-04-19</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>四月你好|你才是疲惫生活中的梦想与糖</td>\n",
       "      <td>2020-04-12</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00后，应该怎样面对这个社会</td>\n",
       "      <td>2020-04-05</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>成年人都学会了悄无声息</td>\n",
       "      <td>2020-03-29</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>这人世间的烟火，我也曾努力看过</td>\n",
       "      <td>2020-03-22</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>白色情人节|情侣之间，什么最重要？</td>\n",
       "      <td>2020-03-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>你还在问什么时候开学吗？</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>十年剧情终落幕，有情人终成眷属</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>青春是过往，年少是序章</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>青春是过往，年少是序章</td>\n",
       "      <td>2020-03-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>白天的理性总在夜色的沉默里翻了船</td>\n",
       "      <td>2020-03-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>无论怎样，年轻的你并不丑</td>\n",
       "      <td>2020-03-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>白天的理性总在夜色的沉默里翻了船</td>\n",
       "      <td>2020-02-27</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>无论怎样，年轻的你并不丑</td>\n",
       "      <td>2020-02-26</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>十年剧情终落幕，有情人终成眷属</td>\n",
       "      <td>2020-02-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>怕麻烦，会让你的人生变得越来越平庸</td>\n",
       "      <td>2020-02-09</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>再见，过去三年  你好，未来四年</td>\n",
       "      <td>2018-10-05</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>余生很长，我只想和你走一趟</td>\n",
       "      <td>2018-09-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>有个男人，他很爱你</td>\n",
       "      <td>2018-08-30</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>愿你我被世界善待，被生活热爱</td>\n",
       "      <td>2018-08-28</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 title create_time  \\\n",
       "0        与自己和解，生活会快乐很多  2020-05-06   \n",
       "1             人生有多少个十年  2020-04-27   \n",
       "2            受够了，我想开学了  2020-04-19   \n",
       "3   四月你好|你才是疲惫生活中的梦想与糖  2020-04-12   \n",
       "4       00后，应该怎样面对这个社会  2020-04-05   \n",
       "5          成年人都学会了悄无声息  2020-03-29   \n",
       "6      这人世间的烟火，我也曾努力看过  2020-03-22   \n",
       "7    白色情人节|情侣之间，什么最重要？  2020-03-14   \n",
       "8         你还在问什么时候开学吗？  2020-03-08   \n",
       "9      十年剧情终落幕，有情人终成眷属  2020-03-08   \n",
       "10         青春是过往，年少是序章  2020-03-08   \n",
       "11         青春是过往，年少是序章  2020-03-04   \n",
       "12    白天的理性总在夜色的沉默里翻了船  2020-03-04   \n",
       "13        无论怎样，年轻的你并不丑  2020-03-04   \n",
       "14    白天的理性总在夜色的沉默里翻了船  2020-02-27   \n",
       "15        无论怎样，年轻的你并不丑  2020-02-26   \n",
       "16     十年剧情终落幕，有情人终成眷属  2020-02-14   \n",
       "17   怕麻烦，会让你的人生变得越来越平庸  2020-02-09   \n",
       "18    再见，过去三年  你好，未来四年  2018-10-05   \n",
       "19       余生很长，我只想和你走一趟  2018-09-08   \n",
       "20           有个男人，他很爱你  2018-08-30   \n",
       "21      愿你我被世界善待，被生活热爱  2018-08-28   \n",
       "\n",
       "                                                 link  \n",
       "0   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "1   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "2   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "3   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "4   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "5   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "6   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "7   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "8   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "9   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "10  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "11  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "12  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "13  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "14  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "15  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "16  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "17  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "18  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "19  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "20  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  \n",
       "21  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  "
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_url_out[~df_url_out.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "      <th>variable</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>与自己和解，生活会快乐很多</td>\n",
       "      <td>2020-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>人生有多少个十年</td>\n",
       "      <td>2020-04-27</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>受够了，我想开学了</td>\n",
       "      <td>2020-04-19</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>四月你好|你才是疲惫生活中的梦想与糖</td>\n",
       "      <td>2020-04-12</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>你好</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00后，应该怎样面对这个社会</td>\n",
       "      <td>2020-04-05</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>成年人都学会了悄无声息</td>\n",
       "      <td>2020-03-29</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>悄无声息</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>这人世间的烟火，我也曾努力看过</td>\n",
       "      <td>2020-03-22</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>努力</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>白色情人节|情侣之间，什么最重要？</td>\n",
       "      <td>2020-03-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>情侣</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>你还在问什么时候开学吗？</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>十年剧情终落幕，有情人终成眷属</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>有情人终成眷属</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>青春是过往，年少是序章</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>青春是过往，年少是序章</td>\n",
       "      <td>2020-03-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>白天的理性总在夜色的沉默里翻了船</td>\n",
       "      <td>2020-03-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>沉默</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>无论怎样，年轻的你并不丑</td>\n",
       "      <td>2020-03-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>年轻</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>白天的理性总在夜色的沉默里翻了船</td>\n",
       "      <td>2020-02-27</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>沉默</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>无论怎样，年轻的你并不丑</td>\n",
       "      <td>2020-02-26</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>年轻</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>十年剧情终落幕，有情人终成眷属</td>\n",
       "      <td>2020-02-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>有情人终成眷属</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>怕麻烦，会让你的人生变得越来越平庸</td>\n",
       "      <td>2020-02-09</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>平庸</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>再见，过去三年  你好，未来四年</td>\n",
       "      <td>2018-10-05</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>你好</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>余生很长，我只想和你走一趟</td>\n",
       "      <td>2018-09-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>余生</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>有个男人，他很爱你</td>\n",
       "      <td>2018-08-30</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>爱</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>愿你我被世界善待，被生活热爱</td>\n",
       "      <td>2018-08-28</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>爱</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 title create_time  \\\n",
       "0        与自己和解，生活会快乐很多  2020-05-06   \n",
       "1             人生有多少个十年  2020-04-27   \n",
       "2            受够了，我想开学了  2020-04-19   \n",
       "3   四月你好|你才是疲惫生活中的梦想与糖  2020-04-12   \n",
       "4       00后，应该怎样面对这个社会  2020-04-05   \n",
       "5          成年人都学会了悄无声息  2020-03-29   \n",
       "6      这人世间的烟火，我也曾努力看过  2020-03-22   \n",
       "7    白色情人节|情侣之间，什么最重要？  2020-03-14   \n",
       "8         你还在问什么时候开学吗？  2020-03-08   \n",
       "9      十年剧情终落幕，有情人终成眷属  2020-03-08   \n",
       "10         青春是过往，年少是序章  2020-03-08   \n",
       "11         青春是过往，年少是序章  2020-03-04   \n",
       "12    白天的理性总在夜色的沉默里翻了船  2020-03-04   \n",
       "13        无论怎样，年轻的你并不丑  2020-03-04   \n",
       "14    白天的理性总在夜色的沉默里翻了船  2020-02-27   \n",
       "15        无论怎样，年轻的你并不丑  2020-02-26   \n",
       "16     十年剧情终落幕，有情人终成眷属  2020-02-14   \n",
       "17   怕麻烦，会让你的人生变得越来越平庸  2020-02-09   \n",
       "18    再见，过去三年  你好，未来四年  2018-10-05   \n",
       "19       余生很长，我只想和你走一趟  2018-09-08   \n",
       "20           有个男人，他很爱你  2018-08-30   \n",
       "21      愿你我被世界善待，被生活热爱  2018-08-28   \n",
       "\n",
       "                                                 link variable  \n",
       "0   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...     无法分类  \n",
       "1   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...     无法分类  \n",
       "2   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...     无法分类  \n",
       "3   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...       你好  \n",
       "4   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...     无法分类  \n",
       "5   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...     悄无声息  \n",
       "6   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...       努力  \n",
       "7   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...       情侣  \n",
       "8   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...     无法分类  \n",
       "9   http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  有情人终成眷属  \n",
       "10  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...     无法分类  \n",
       "11  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...     无法分类  \n",
       "12  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...       沉默  \n",
       "13  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...       年轻  \n",
       "14  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...       沉默  \n",
       "15  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...       年轻  \n",
       "16  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...  有情人终成眷属  \n",
       "17  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...       平庸  \n",
       "18  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...       你好  \n",
       "19  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...       余生  \n",
       "20  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...        爱  \n",
       "21  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...        爱  "
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_o = df_url_out.join(df_cat).replace(\"\", np.nan).fillna(\"无法分类\")\n",
    "df_o"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "      <th>variable</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>青春是过往，年少是序章</td>\n",
       "      <td>2020-03-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>青春是过往，年少是序章</td>\n",
       "      <td>2020-03-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          title create_time  \\\n",
       "10  青春是过往，年少是序章  2020-03-08   \n",
       "11  青春是过往，年少是序章  2020-03-04   \n",
       "\n",
       "                                                 link variable  \n",
       "10  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...     无法分类  \n",
       "11  http://mp.weixin.qq.com/s?__biz=MzU3NTcwNDQzMw...     无法分类  "
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_o[df_o.title.str.contains(\"青春\")]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>variable</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>无法分类</th>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>你好</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>年轻</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>有情人终成眷属</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>沉默</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>爱</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>余生</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>努力</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>平庸</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>悄无声息</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>情侣</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          title\n",
       "variable       \n",
       "无法分类          7\n",
       "你好            2\n",
       "年轻            2\n",
       "有情人终成眷属       2\n",
       "沉默            2\n",
       "爱             2\n",
       "余生            1\n",
       "努力            1\n",
       "平庸            1\n",
       "悄无声息          1\n",
       "情侣            1"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_stats = df_o.groupby(by=\"variable\").agg({\"title\":\"count\"}).sort_values(by=\"title\", ascending=False)\n",
    "df_stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 输出"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
